IS608 Project -- James Quacinella

Goal

This is the final project submission for IS608, a required course in the Master's Degree in Data Analytics program at SPS CUNY. A list of projects from my fellow classmates can be found here. All the code and data is hosted on Github. Checkout the final visualization, if you want to skip over the explanation below.

The goal of this project is to create choropleth map of energy consumption on a per-county level. Data from the EIA gives data on utility level demand and what counties utilities deliver power to. We use data, plus Census data on population, to come up with an estimate of per-county energy consumption. This methodology will be explained below.

Data Sources

EIA Form 618 - This data folder has many Excel files, two of which are valuable
- Service Territories maps each County to all the utility IDs that serves it
- Retail data maps the amount of energy produced by each utlitity
Annual Estimates of the Resident Population: April 1, 2010 to July 1, 2013 from U.S. Census Bureau, Population Division
- Actually downloaded from FactFinder

High Level Code Overview

Here is some pseudo-code decribing the whole process, from raw data to final map output

Build a mapping from county ID to its 2012 population (2012 being latest data from the EIA)
Build a mapping from utility ID to the total energy output of that utility
Build a mapping from utility ID to list of county IDs is serves
Derive a mapping from utility ID to the sum of county populations of all the counties it serves
For each county
- For each utility that serves this county, calculate an estimate of the county's consumption based of the ratio of the county's population to the total population the utility serves.
Output the final county ID to consumption estimate to CSV. This will be served via a webserver to D3

Issues with Data

The EIA data maps the Utility to a County by name, which did not eactly match the names of counties in the census data. This required some manual cleanup, which are being handled by the initial load data function.
The EIA data does not have more granular data about how much each county consumes of its total energy output. A methodlolgy needed to be developed to approximate this data. I have contacted the EIA for comment, with no response yet.
- Update: EIA responded and confirmed my suspicions.
Not all utilities have retail data. When calculating how much energy was consumed by a county, if we come across a utility we do not have data on, we assume 0. It might be better to take an average of all other utilities as a better approximation of any missing utility data.
- Update: While there are still utilities that do not have data, I realized that another file in the data set gives us some of this data. The short form is another way for utilities to send in their data to the EIA in a condensed way. This is kept in another file. The code below has been updated to read from it and use that data as well.
  - At this point, there are 358 counties that have some missing data, out of a total of 3129. For example, a county may have 4 utilities, but we have data only one three. We still come up with an estimate, but this estimate is low-balling the total consumption.

Methodology for Estimating Per-County Consumption

Since the data from the EIA is organized such that utilties only have to report their total numbers, we do not have direct "county to consumption" numbers. This needs to be estimated from what we have, and here is my attempt at doing so.

Given a certain county, there will be an associated number of utilities that provide energy to it. Conversely, we also have for a given utility, a list of counties it provides energy for. From the census data, we can then get an idea of how many people a utility servers by summing the population across counties it serves. This is an upper estimation, since the utility more than likely does not serve 100% of a county (due to competition, etc).

What we have from data is the total amount of consumption per utility, and now we also have a total population that the utility potentially serves. Given this, as an estimation, we can say that a given county consumes energy based on a population ratio of the county's population to the total amount of people served by the utility.

The final estimation of a county's usage is to sum these contributions across all utilities a given county is served by.

Improvements to the Methodology

Any county level data from the utilities themselves would have been best. Even just the total number of customers it serves per county woul have made things more accurate
For missing utility data, come up with a better estimate than our current estimate of 0

Code



In [6]:

    
### Globals and Constants

import csv
from collections import defaultdict

# Column indicies for EIA data
UTILITY = 1
STATE = 3
COUNTY = 4

# Column indicies for EIA retail data
UTILITY_ID = 1
CONSUMPTION = 21
SHORT_CONSUMPTION = 5

# Column indicies for Census data
ID = 1
DESC = 2
POP2012 = 7

# Constants
YEAR = 2012
PATH_TO_SERVICE_DATA = "data/f8612012/service_territory_%s.csv" % YEAR  # Path to the defined utility service territories
PATH_TO_RETAIL_DATA = "data/f8612012/retail_sales_%s.csv" % YEAR  # Path to the defined utility service territories
PATH_TO_SHORTRETAIL_DATA = "data/f8612012/short_form_%s_Changed.csv" % YEAR  # Path to the defined utility service territories
PATH_TO_POPULATION_DATA = "data/PEP_2013_PEPANNRES/PEP_2013_PEPANNRES_with_ann_with_changes.csv"
STATES = { 'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AS': 'American Samoa','AZ': 'Arizona','CA': 'California','CO': 'Colorado','CT': 'Connecticut','DC': 'District of Columbia','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','GU': 'Guam','HI': 'Hawaii','IA': 'Iowa','ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana','MA': 'Massachusetts','MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota','MO': 'Missouri','MP': 'Northern Mariana Islands','MS': 'Mississippi','MT': 'Montana','NA': 'National','NC': 'North Carolina','ND': 'North Dakota','NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey','NM': 'New Mexico','NV': 'Nevada','NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania','PR': 'Puerto Rico','RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee','TX': 'Texas','UT': 'Utah','VA': 'Virginia','VI': 'Virgin Islands','VT': 'Vermont','WA': 'Washington','WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'}



# Global variables
countyToUtility = {}        # Mapping from county number to a list of utilities serving it
utilityToCounty = {}        # Mapping from utility id to a list of counties it serves
countyToPopulation = {}     # Mapping from county number to a population from census data
nameToID = {}               # Mapping from county name to the county code
utilityToConsumption = dict()   # Mapping the utility id to the total consumption in mWh
utilityToPopulation = dict()    # Mapping the utility id to the total county population it serves
countyToConsumption = dict()    # Mapping the final result of county ID to consumption in mWh

Load the Data

This function reads in the census data, and loads a dictionary that maps the county name to its county ID (nameToID), and another dictionary that maps from the county ID to the population estimate in 2012 (countyToPopulation). The county name is constructed as "state_countyname", all lowercase



In [7]:

    
def loadCensusData():
    ''' Loads the mapping from county number to population into 
    'countyToUPopulation' and a name to ID mapping from the PATH_TO_POPULATION_DATA file. '''
    f = open(PATH_TO_POPULATION_DATA)
    reader = csv.reader(f)
    reader.next()
    reader.next()
    for row in reader:
        # Grab the important parts of the data
        id = int(row[ID])
        desc = row[DESC]
        pop2012 = row[POP2012]

        # Derive the 'county key' from the description column
        (county, state) = desc.split(',')
        county = county.lower().replace("county", "").replace(".", "").replace(" ", "")
        state = state.lower().replace(' ', '')
        key = state + '_' + county

        # correction to Lousiana county names
        if state == "louisiana":
            key = key.replace("parish", "")

        # Setup the two mappings
        nameToID[key] = id
        countyToPopulation[id] = float(pop2012)


def loadCountytoUtilityData():
    ''' Loads the mapping from county number to utilities into 
    'countyToUtility' from the PATH_TO_SERVICE_DATA file. '''
    f = open(PATH_TO_SERVICE_DATA)
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        state = STATES[ row[STATE].upper() ].lower().replace(' ', '')
        county = row[COUNTY].lower().replace(' ', '').replace('.', '')
        key = state + '_' + county
        utilityID = int(row[UTILITY])

        try:
            if nameToID[key] in countyToUtility:
                countyToUtility[nameToID[key]].add(utilityID)
            else:
                #if key not in nameToID:
                #    print "key %s not found" % key
                countyToUtility[nameToID[key]] = set([utilityID])

            if utilityID in utilityToCounty:
                utilityToCounty[utilityID].add(nameToID[key])
            else:
                utilityToCounty[utilityID] = set([nameToID[key]])
        except Exception as e:
            pass
            #print "Exception: %s" % key


def loadUtilityConsumptionData():
    ''' Load the retail data as  mapping from utility ID 
    to total level of consumption in mWh. '''
    # Open file as CSV and skip three header lines
    f = open(PATH_TO_RETAIL_DATA)
    reader = csv.reader(f)
    reader.next()
    reader.next()
    reader.next()

    # For each row in the reader ...
    for row in reader:
        # Read in the ID and consumption total level
        id = int(row[UTILITY_ID])
        consumption = float(row[CONSUMPTION].replace(',', ''))

        # Update global mapping
        utilityToConsumption[id] = consumption


    f = open(PATH_TO_SHORTRETAIL_DATA)
    reader = csv.reader(f)
    reader.next()

    # For each row in the reader ...
    for row in reader:
        # Read in the ID and consumption total level
        id = int(row[UTILITY_ID])
        consumption = float(row[SHORT_CONSUMPTION].replace(',', ''))

        # Update global mapping
        if id in utilityToConsumption:
            print "Whoa, we have a duplicate utility id: %d" % id

        utilityToConsumption[id] = consumption


def deriveUtilityPopulation():
    ''' Derive the total number of people in all counties served by a utility. '''

    # Loop through all utility companies and the list of counties served by them
    for utility, counties in utilityToCounty.items():
        # Calculate total population for all counties
        totPopulation = 0
        for county in counties:
            totPopulation += countyToPopulation[county]

        # Save total population in mapping
        utilityToPopulation[utility] = float(totPopulation)



def calculateCountyConsumption():
    ''' Apply our methodology and generate a mapping of county to total energy consumption in mWh. '''
    countiesWithErrors = 0
    for county, utilities in countyToUtility.items():
        # Since we are using defaultdicts, if we try to access data on a utility we do not have, 
        # it'll be counted as 0. See 'Issues' section above.
        #countyToConsumption[county] = sum([((countyToPopulation[county] / utilityToPopulation[utility]) * 
        #                                   utilityToConsumption[utility]) for utility in utilities])
        countySum = 0
        errorCount = 0

        for utility in utilities:
            try:
                countySum += ((countyToPopulation[county] / utilityToPopulation[utility]) * utilityToConsumption[utility])
            except Exception as e:
                errorCount+=1

                #print "Error for county %d and utility %d" % (county, utility)

        if errorCount != 0:
            countiesWithErrors+=1
            #print "Number of errors: %d / %d" % (errorCount, len(utilities))

        countyToConsumption[county] = countySum

    #print "Counties with errors: %d / %d" % (countiesWithErrors, len(countyToUtility.keys()))



# Start by loading the census data and the service territory mapping
loadCensusData()
loadCountytoUtilityData()
loadUtilityConsumptionData()
deriveUtilityPopulation()
calculateCountyConsumption()

Examine the data



In [8]:

    
# Print random sample of county name to list of utility IDs
for county, utilityList in countyToUtility.items()[0:10]:
  print "%s is served by utility IDs %s" % (county, list(utilityList))









    



15003 is served by utility IDs [19547]
54065 is served by utility IDs [15263]
42009 is served by utility IDs [20387, 40167, 13205, 14711, 40221, 40222]
30007 is served by utility IDs [12825, 23586]
30043 is served by utility IDs [12825, 23586]
9003 is served by utility IDs [4176, 6207]
41003 is served by utility IDs [14354, 40437, 4743]
41005 is served by utility IDs [15248, 2955]
41007 is served by utility IDs [20385, 14354, 18917, 28541]
13095 is served by utility IDs [18305, 12706, 7140, 230]



In [9]:

    
# Print random sample of county name to list of utility IDs
for name, countyID in nameToID.items()[0:10]:
  print "%s (%s) had a population of %s" % (name, countyID, countyToPopulation[countyID])









    



kansas_elk (20049) had a population of 2674.0
alabama_wilcox (1131) had a population of 11406.0
florida_washington (12133) had a population of 24854.0
newmexico_mckinley (35031) had a population of 72726.0
southcarolina_cherokee (45021) had a population of 55760.0
texas_shackelford (48417) had a population of 3368.0
westvirginia_grant (54023) had a population of 11814.0
georgia_candler (13043) had a population of 11107.0
illinois_pope (17151) had a population of 4271.0
missouri_marion (29127) had a population of 28818.0



In [10]:

    
# Print random sample of utility ID to consumption data
for utilityID, consumption in utilityToConsumption.items()[0:10]:
    if consumption != 0:
        print "Utility ID %s in 2012 had %s mWhs consumed" % (utilityID, consumption)









    



Utility ID 8192 in 2012 had 83697.0 mWhs consumed
Utility ID 8198 in 2012 had 696574.0 mWhs consumed
Utility ID 8199 in 2012 had 102237.0 mWhs consumed
Utility ID 57354 in 2012 had 2060.0 mWhs consumed
Utility ID 8205 in 2012 had 43237.0 mWhs consumed
Utility ID 24590 in 2012 had 423362.0 mWhs consumed
Utility ID 8209 in 2012 had 28637.0 mWhs consumed
Utility ID 8210 in 2012 had 535614.0 mWhs consumed
Utility ID 8212 in 2012 had 307219.0 mWhs consumed
Utility ID 15022 in 2012 had 37062.0 mWhs consumed

Mid Term Visualizations

Please see:

graph.html for a D3 visualization of the utilities and counties as a mathematical graph (only a subset)
- Inspired by http://bl.ocks.org/d3noob/5141278
barchart.html for a Google Chart visualization of utility production as a bar graph
- Inspired by https://google-developers.appspot.com/chart/interactive/docs/gallery/columnchart

Final Visualizations

Using the data we have and our estimations of county energy consumption, we can plot a map of the united states showing which counties consume the most energy.



In [6]:

    
from IPython.display import HTML
HTML('<iframe src=http://jquacinella.github.io/IS608_Project/map.html width=100% height=750></iframe>')
#HTML('<iframe src="file:///home/james/Code/Masters/IS608/Project/map.html" width=100% height=1000></iframe>')









    Out[6]:

Lets also check out this map when we normalize by population. This will give us a view into avergae individual consumption by county:



In [7]:

    
from IPython.display import HTML
HTML('<iframe src=http://jquacinella.github.io/IS608_Project/map_norm.html width=100% height=750></iframe>')
#HTML('<iframe src="file:///home/james/Code/Masters/IS608/Project/map.html" width=100% height=1000></iframe>')









    Out[7]:

Difficulties

The major difficulty in generating the above maps was determing a good color scale and the appropriate range to use for the legend. The consumption, as can be seen in the google charts example, the underlying distribution is skewed and has a long tail. Making a linear range will inevitably leave things out. Possibly using a log-scale would help, but I am not sure if that works well with coloring. Right now, many of the extreme outliers are in the same grouping, which allows us to see more detail in the map.

Results and Discussions

The above maps show an interesting pattern: when comparing the maps, we see a distinct difference: big cities, especially on the coasts have bigger total consumption (which is expected with higher populations); the middle of the country have lower total consumption but have higher per-capita consumption. That means that cities seem to be more effcient with regards to how much any given person consumes electricity. This could be due to many things:

Higher prices leading to conservation
Better government support for conservation efforts
Smaller real-estate, which is easier to cool in the summer

Further investigation would be needed to figure out why much of the middle of the country has such higher individual usage.

Improvements to Visualizations

As mentioned above, improvements can be made to our estimation methodology
Continue this work for more than just 2012
- Potentially create a video of how this map changes over time



In [ ]: